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Abstract. Two neural networks which are trained on their mutual output bits are 
analysed using methods of statistical physics. The exact solution of the dynamics 
of the two weight vectors shows a novel phenomenon: The networks synchronize to 
a state with identical time dependent weights. Extending the models to multilayer 
networks with discrete weights, it is shown how synchronization by mutual learning 
can be applied to secret key exchange over a public channel. 

1 Introduction 

Neural networks learn from examples. This concept has extensively been in- 
vestigated using models and methods of statistical mechanics |l|,H . A " teacher' 
network is presenting input/output pairs of high dimensional data, and a 
"student" network is being trained on these data. Training means, that synap- 
tic weights adopt by simple rules to the input/output pairs. 

When the networks — teacher as well as student — have N weights, the 
training process needs of the order of N examples to obtain generalization 
abilities. This means, that after the training phase the student has achieved 
some overlap to the teacher, their weight vectors are correlated. As a conse- 
quence, the student can classify an input pattern which docs not belong to 
the training set. The average classification error decreases with the number 
of training examples. 

Training can be performed in two different modes: Batch and on-line 
training. In the first case all examples are stored and used to minimize the 
total training error. In the second case only one new example is used per time 
step and then destroyed. Therefore on-line training may be considered as a 
dynamic process: at each time step the teacher creates a new example which 
the student uses to change its weights by a tiny amount. In fact, for random 
input vectors and in the limit iV — > oo, learning and generalization can be 
described by ordinary differential equations for a few order parameters ||^ . 

On-line training is a dynamic process where the examples are generated 
by a static network - the teacher. The student tries to move towards the 
teacher. However, the student network itself can generate examples on which 
it is trained. When the output bit is moved to the shifted input sequence, the 
network generates a complex time series 0] . Such networks are called bit (for 
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binary) or sequence (for continuous numbers) generators and have recently 
been studied in the context of time series prediction Q. 

This work on the dynamics of neural networks - learning from a static 
teacher or generating time series by self interaction - has motivated us to 
study the following problem: What happens if two neural networks learn 
from each other? In the following section an analytic solution is presented 

which shows a novel phenomenon: synchronization by mutual learning. 
The biological consequences of this phenomenon are not explored, yet, but 
we found an interesting application in cryptography: secure generation of a 
secret key over a public channel. 

In the field of cryptography, one is interested in methods to transmit secret 
messages between two partners A and B. An opponent E who is able to listen 
to the communication should not be able to recover the secret message. 

Before 1976, all cryptographic methods had to rely on secret keys for 
encryption which were transmitted between A and B over a secret channel 
not accessible to any opponent. Such a common secret key can be used, for 
example, as a seed for a random bit generator by which the bit sequence of 
the message is added (modulo 2). 

In 1976, however, Diffie and Hellmann found that a common secret key 
could be created over a public channel accessible to any opponent. This 
method is based on number theory: Given limited computer power, it is not 
possible to calculate the discrete logarithm of sufficiently large numbers 0. 

Here we show how neural networks can produce a common secret key by 
exchanging bits over a public channel and by learning from each other. 



2 Dynamic transition to synchronization 

Here we study mutual learning of neural networks for a simple model system: 
Two perceptrons receive a common random input vector x and change their 
weights w according to their mutual bit ct, as sketched in Fig. |l|. The output 
bit cr of a single perceptron is given by the equation 

a — sign(w • x) (1) 

X is an A^-dimensional input vector with components which are drawn from 
a Gaussian with mean and variance 1. u; is a A'^-dimensional weight vector 
with continuous components which are normalized, 

w-w = l (2) 

A/B 

The initial state is a random choice of the components ,i = 1, ...N 

for the two weight vectors uA and w^. At each training step a common 
random input vector is presented to the two networks which generate two 
output bits and according to (|^). Now the weight vectors are updated 
by the perceptron learning rule B : 
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Fig. 1. Two perceptrons receive an identical 
input X and learn their mutual output bits a. 



wr{t + 1) = u/'it) + -^xn" e{-a''a'') 



uf{t + 1) = iif{t) + j^xa"" e(-a^(7^) 



(3) 



0{x) is the step function. Hence, only if the two perceptrons disagree a train- 
ing step is performed with a learning rate 77. After each step (||), the two 
weight vectors have to be normalized. 
In the limit iV — > cxd, the overlap 



R{t) = u/^{t) w^{t) 



(4) 



has been calculated analytically . The number of training steps t is scaled 
as a = t/N ^ and R{a) follows the equation 



(5) 



where ip is the angle between the two weight vectors and wP , i.e. R 
cosip. This equation has fixed points R = 1,R = —1, and 



/27r 



1 — cos (fi 



(6) 



Fig. H shows the attractive fixed point of || as a function of the learning 
rate 77. For small values of r/ the two networks relax to a state of a mutual 
agreement, i? —> 1 for 77 ^ 0. With increasing learning rate r] the angle 
between the two weight vectors increases up to = 133° for 



V ^ Vc — 1.816 



(7) 



Above the critical rate r/c the networks relax to a state of complete disagree- 
ment, ip = 180°, i? = —1. The two weight vectors are antiparallel to each 
other, ur^ = —w^. 
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Fig. 2. Final overlap 7? be- 
tween two perceptrons as a 
function of learning rate rj. 
Above a critical rate rjc the 
time dependent networks are 
synchronized. From Ref. 



As a consequence, the analytic solution shows, well supported by numer- 
ical simulations for TV = 100, that two neural networks can synchronize to 
each other by mutual learning. Both of the networks are trained to the exam- 
ples generated by their partner and finally obtain an antiparallel alignment. 
Even after synchronization the networks keep moving, the motion is a kind of 
random walk on an N-dimensional hypersphere producing a rather complex 
bit sequence of output bits — —a^ [||. 

3 Random walk in weight space 

We want to apply synchronization of neural networks to cryptography. In 
the previous section we have seen that the weight vectors of two perceptrons 
learning from each other can synchronize. The new idea is to use the common 
weights = —wP a-s a key for encryption But two problems have to 
be solved yet: (i) Can an external observer, recording the exchange of bits, 
calculate the final y/^ [t) , (ii) does this phenomenon exist for discrete weights? 
Point (i) is essential for cryptography, it will be discussed in the following 
section. Point (ii) is important for practical solutions since communication is 
usually based on bit sequences. It will be investigated in the following. 

Synchronization occurs for normalized weights, unnormalized ones do not 
synchronize Therefore, for discrete weights, we introduce a restriction 

A/ B 

in the space of possible vectors and limit the components to 2i + 1 

different values, 

u)f/-^ e {-L,-L+ 1,L} (8) 

In order to obtain synchronization to a parallel - instead of an antiparallel — 
state — w^, we modify the learning rule to: 
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ut^{t + 1) = w^(t) - xa^O{a^a^) 

w''{t + l) = w^{t)^xa^O{a^a^) (9) 

Now the components of the random input vector x are binary Xi G {+1,-1}. 
If the two networks produce an identical output bit then their 

weights move one step in the direction of —Xia"^. But the weights should 
remain in the interval (^), therefore if any component moves out of this 
interval, \wi \ = L + 1, it is set back to the boundary Wi = ±L. 

Each component of the weight vectors performs a kind of random walk 
with reflecting boundary. Two corresponding components wf and wf receive 
the same random number ±1. After each hit at the boundary the distance 
\wf — wf\ is reduced until it has reached zero. For two perceptrons with a 
A^-dimensional weight space we have two ensembles of N random walks on 
the internal {—L,...,L}. If we neglect the global signal = as well as 
the bias fi"*, we expect that after some characteristic time scale r = O(L^) 
the probability of two random walks being in different states decreases as 

Pit) - P(0)e-*/^ (10) 

Hence the total synchronization time should be given by N ■ P{t) ~ 1 which 
gives 

tsync-T\nN (11) 

In fact, our simulations for N — 100 show that two perceptrons with i = 3 
synchronize in about 100 time steps and the synchronization time increases 
logarithmically with N. However, our simulations also showed that an op- 
ponent, recording the sequence of (c'^, cr'^, a;)t is able to synchronize, too. 
Therefore, a single perceptron does not allow a generation of a secret key. 



4 Secret key generation 

Obviously, a single perceptron transmits too much information. An opponent, 
who knows the set of input/output pairs, can derive the weights of the two 
partners after synchronization. Therefore, one has to hide so much informa- 
tion, that the opponent cannot calculate the weights, but on the other side 
one has to transmit enough information that the two partners can synchro- 
nize. 

In fact, we found that multilayer networks with hidden units may be 
candidates for such a task More precisely, we consider parity machines 
with three hidden units as shown in Fig. ^. Each hidden unit is a perceptron 
(^ with discrete weights The output bit r of the total network is the 
product of the three bits of the hidden units 
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Fig. 3. Parity machine with 
three hidden units. 



r = CTi (72 

B B B 

T = (It (Tn Gr, 



(12) 



At each training step the two machines A and B receive identical input 
vectors x_^,x_2^x_j^. The training algorithm is the following: Only if the two 
output bits are identical, r"^ = , the weights can be changed. In this case, 
only the hidden unit ai which is identical to t changes its weights using the 
Hebbian rule 

mf{t + l)=Ulf{t)-x.r^ (13) 

For example, if r'^ = = 1 there are four possible configurations of the 
hidden units in each network: 

(+1,+1,+1),(+1,-1,-1),(-1,+1,+1),(-1,-1,+1) 
In the first case, all three weight vectors w^,W2t1^3 changed, in all other 
three cases only one weight vector is changed. The partner as well as any 
opponent does not know which one of the weight vectors is updated. 

The partners A and B react to their mutual stop and move signals and 
, whereas an opponent can only receive these signals but not influence the 
partners with its own output bit. This is the essential mechanism which allows 
synchronization but prohibits learning. Numerical |^ as well as analytical 
calculations of the dynamic process show that the partners can synchronize 
in a short time whereas an opponent needs a much longer time to lock into 
the partners. 

This observation holds for an observer who uses the same algorithm ( p^ 
as the two partners A and B. Note that the observer knows 1. the algorithm 
of A and B, 2. the input vectors Xi,X2,x^ at each time step and 3. the 
output bits r'^ and at each time step. Nevertheless, he does not succeed 
in synchronizing with A and B within the communication period. 

Since for each run the two partners draw random initial weights and since 
the input vectors are random, one obtains a distribution of synchronization 
times as shown in Fig. |^ for N = 100 and L = 3. The average value of 
this distribution is shown as a function of system size N in Fig. |^. Even an 
infinitely large network needs only a finite number of exchanged bits - about 
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Fig. 4. Distribution of syn- 
chronization time for TV — 
100, L = 3. 
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Fig. 5. Average synchroniza- 
tion time as a function of in- 
verse system size. 



400 in this case - to synchronize, in agreement with the analytical calculation 
for TV ^ cxD. 

If the communication continues after synchronization, an opponent has 
a chance to lock into the moving weights of A and B. Fig. ^ shows the dis- 
tribution of the ratio between the synchronization time of A and B and the 
learning time of the opponent. In our simulations, for A'^ = 100, this ratio 
never exceeded the value r = 0.1, and the average learning time is about 
50000 time steps, much larger than the synchronization time. Hence, the 
two partners can take their weights wf{t) = wf{t) at a time step t where 
synchronization most probably occurred as a common secret key. Synchro- 
nization of neural networks can be used as a key exchange protocol over a 
public channel. 
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Fig. 6. Distribution of the ra- 
tio of synchronization time 
between networks A and B to 
" 0.02 0.04 0.06 |- 0.08 the learning time of an at- 

tacker E. 

5 Conclusions 

Interacting neural networks have been calculated analytically. At each train- 
ing step two networks receive a common random input vector and learn their 
mutual output bits. A new phenomenon has been observed: Synchronization 
by mutual learning. If the learning rate rj is large enough, and if the weight 
vectors keep normalized, then the two networks relax to an antiparallel orien- 
tation. Their weight vectors still move like a random walk on a hypersphere, 
but each network has complete knowledge about its partner. 

It has been shown how this phenomenon can be used for cryptography. 
The two partners can agree on a common secret key over a public channel. An 
opponent who is recording the public exchange of training examples cannot 
obtain full information about the secrete key used for encryption. 

This works if the two partners use multilayer networks, parity machines. 
The opponent has all the informations (except the initial weight vectors) of 
the two partners and uses the same algorithms. Nevertheless he does not 
synchronize. 

This phenomenon may be used as a key exchange protocol. The two part- 
ners select secret initial weight vectors, agree on a public sequence of input 
vectors and exchange public bits. After a few steps they have identical weight 
vectors which are used for a secret encryption key. For each communication 
they agree on a new secret key, without having stored any secret information 
before. In contrast to number theoretical methods the networks are very fast; 
essentially they are linear filters, the complexity to generate a key of length 
N scales with TV (for sequential update of the weights). 

Of course, one cannot rule out that algorithms for the opponent may be 
constructed which find the key in much shorter time. In fact, ensembles of op- 
ponents have a better chance to synchronize. In addition, one can show that, 
given the information of the opponent, the key is uniquely determined, and, 
given the sequence of inputs, the number of keys is huge but finite, even in the 
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limit N ^ oo ||ll||. These may be good news for a possible attacker. However, 
recently we have found advanced algorithms for synchronization, too. Such 
variations are subjects of active research, and future will show whether the se- 
curity of neural network cryptography can compete with number theoretical 
methods. 
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